Neural network image processing

ABSTRACT

A computer, including a processor and a memory, the memory including instructions to be executed by the processor to determine a second convolutional neural network (CNN) training dataset by determining an underrepresented object configuration and an underrepresented noise factor corresponding to an object in a first CNN training dataset, generate one or more simulated images including the object corresponding to the underrepresented object configuration in the first CNN training dataset by inputting ground truth data corresponding to the object into a photorealistic rendering engine and generate one or more synthetic images including the object corresponding to the underrepresented noise factor in the first CNN training dataset by processing the simulated images with a generative adversarial network (GAN) to determine a second CNN training dataset. The instructions can include further instructions to train a CNN to using the first and the second CNN training datasets and input an image acquired by a sensor to the trained CNN and output an object label and an object location corresponding to the underrepresented object configuration and underrepresented object noise factor.

BACKGROUND

Computing devices, networks, sensors and controllers to acquire dataregarding the environment and identify and locate objects based on thedata. Object identity and location data can be used to determine routesto be traveled and objects to be avoided by a vehicle. Operation of avehicle can rely upon acquiring accurate and timely data regardingobjects in a vehicle's environment while the vehicle is being operatedon a roadway.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of an example traffic infrastructure system.

FIG. 2 is a diagram of an example convolutional neural network.

FIG. 3 is a diagram of an example generative adversarial network.

FIG. 4 is a diagram of an example image of a traffic scene.

FIG. 5 is a diagram of another example image of a traffic scene.

FIG. 6 is a diagram of an example simulation image.

FIG. 7 is a diagram of an example synthetic image.

FIG. 8 is a flowchart diagram of an example process to determine anobject label.

DETAILED DESCRIPTION

A computing device in a traffic infrastructure system can be programmedto acquire data regarding the external environment of a vehicle and touse the data to determine a vehicle path upon which to operate a vehiclein an autonomous or semi-autonomous mode. A vehicle can operate on aroadway based on a vehicle path by determining commands to direct thevehicle's powertrain, braking, and steering components to operate thevehicle to travel along the path. The data regarding the externalenvironment can include the location of one or more objects such asvehicles and pedestrians, etc., in an environment around a vehicle andcan be used by a computing device in the vehicle to operate the vehicle.

A computing device in a vehicle can be programmed to determine objectlabels and object locations based on image data acquired by a sensorincluded in the vehicle. The object labels and object locations can bebased on a configuration of an object in the image data. An object labelis a text string that describes or identifies an object in an image. Theobject can be a vehicle, for example a car, a truck, or a motorcycle. Anobject configuration is defined as the location, size and appearance ofan object in an image. An object configuration can correspond to valuesand arrangements of pixels based on the object location, objectorientation and appearance of the object. The appearance of the objectcan vary based on partial or full obscuring of the object in the imagedata by other objects in the image or shadows from other objects, forexample. The appearance of the object in the image data can also beaffected by noise factors. Noise factors can be determined forenvironmental conditions including full or partial sunlight, andprecipitation including rain or snow, fog and dust, for example.

An image acquired by a video camera can be processed to determine alocation of one or more objects, including accounting for noise factors.For example, a video camera included in a vehicle can acquire an imageof a traffic scene. The acquired image can be communicated to acomputing device in the vehicle. The computing device can include asoftware program such as a convolutional neural network (CNN) that hasbeen trained to determine labels for objects that occur in the acquiredimage. A CNN will be described below in relation to FIG. 2. The CNN canalso determine an object location based on the acquired image. Becausethe sensor is in a fixed position with regard to the vehicle and thevehicle and the object are both assumed to be positioned on a groundplane defined by a roadway, a location of an object in an image can beprocessed by a CNN to yield real world coordinates of the object withrespect to the vehicle. The computing device can use the determinedobject label and object location to operate the vehicle. For example,the computing device can determine a vehicle path that avoids contactwith the labeled object. A vehicle path can be a curve or curves(including a line or lines) described by one or more polynomialequations upon which the vehicle can be instructed to operate by thecomputing device by controlling vehicle powertrain, braking andsteering. The polynomial curve can be determined to maintain upper andlower limits on lateral and longitudinal accelerations, therebyproviding efficient operation of the vehicle.

Software programs such as CNNs depend upon training to determine objectlabels in image data. A CNN can be trained by presenting the CNN with alarge number (typically >1000) of training images that include objectsalong with corresponding ground truth. Ground truth data is data thatspecifies a correct solution, in this example the label and location foran object in an image, obtained from a source independent from the CNNto be trained. In this example ground truth regarding the trainingimages can be obtained by having a human determine a label and locationfor an object in an image by visual inspection of the image. In otherexamples a human can determine the object label and object location byobserving and measuring the real world traffic scene at the time theimage is acquired. During training the CNN processes an input image andthe result, referred to herein as an output state, is compared to theground truth to train the CNN to produce an output state that matchesthe ground truth in response to inputting an image that corresponds tothe ground truth.

Producing a training dataset of images, and corresponding ground truth,for training a CNN can be an expensive and time consuming process, i.e.,is inefficient and challenging in terms of consumption of computingresources. Further, a trained CNN has limited ability to generalizebeyond the training dataset. A CNN can be successfully trained tocorrectly detect objects such as passenger vehicles, trucks andmotorcycles using a dataset that includes a sample of passengervehicles, trucks, and motorcycles, for example. Detecting an object isdefined as determining a location for the object in an image bydetermining a bounding box for the object and determining a text stringthat identifies the object within the bounding box. A bounding box isrectangle with sides parallel to the top, bottom, left and right sidesof the image, respectively. A CNN trained to detect passenger vehicles,trucks, and motorcycles might not be able to successfully detect avehicle if the image of the vehicle includes a confusing background, orthe image of the vehicle is partially obscured by another vehicle,foliage, or shadows. A confusing background is defined as an image thatincludes features that are close enough in appearance to a vehicle thata trained CNN will generate false alarms, where a false alarm is definedas an output state that includes a labeled object where no such objectoccurs in the input image. Also, environmental conditions includingweather, lighting, and shadows can prevent a CNN from successfullydetecting an object. A CNN trainer tries to anticipate as many of thetypes of objects the CNN will be required to label along with as many ofthe environments in which the objects will occur as permitted by thetime and resources available for training.

Successful training of a CNN can depend upon having sufficient examplesof a vehicle type in a particular configuration. As defined above withregard an object, a vehicle configuration in this context is defined asthe location, size and appearance of a vehicle in an image. As discussedabove, vehicle appearance can be based on other objects in the imageincluding other vehicles, the roadway and objects such as overpasses,barriers, medians, etc. A sufficient number of examples can typically begreater than 100. Successful training of a CNN can depend upon havingsufficient examples (typically >100) of a vehicle type in the presenceof particular noise factors. Noise factors are defined as environmentalconditions that change the appearance of a vehicle in an image. Theseenvironmental conditions include rain, snow, fog, dust, etc. andlighting conditions which can range from direct sunlight to nighttime.For example, direct or bright sunlight can cause shadows that alter theappearance of vehicles in an image.

Motorcycles are an example of a vehicle object that presents difficultyin obtaining a sufficient number of images of the object (i.e., themotorcycle) in different configurations to successfully train a CNN todetermine vehicle labels and locations for a motorcycle in input images.Successful training of a CNN to determine vehicle labels and locationsfor a motorcycle can also depend upon obtaining a sufficient number ofimages of each configuration with a sufficient variety of noise factors.Successful determination of a vehicle label and location for amotorcycle in a given configuration and noise factor can depend upontraining the CNN with a plurality of images that include a motorcycle ina given configuration and the given noise factors. For example,successful determination of a label and location for a motorcycle in aparticular configuration in the rain can depend upon training a CNN witha plurality of images of motorcycles in that configuration in the rain.

Motorcycles can present several challenges in obtaining sufficientnumbers of images for training a CNN. Motorcycles are typically smallerthan other vehicles and it can be difficult to obtain clear images ofmotorcycles due to the motorcycles being obscured by larger vehicles andtheir shadows. Motorcycles also tend not to be present in inclementweather situations. Some configurations and noise factors can present adanger to a motorcycle rider. For example, in some states “lanesplitting”, where a motorcycle can travel between other vehicles along alane marker separating lanes on a roadway is legal. In other examplesriding a motorcycle in an ice storm can be dangerous to the rider.Motorcycle datasets are typically collected under high speed drivingenvironments, which results in motion blur in the image data. This typeof motion blur is hard to generate in simulated images, thereforeprocessing simulated images to produce synthetic images can improverealism in the images. Simulated and synthetic images are defined inrelation to FIGS. 6 and 7, below. Motion blur is defined in relation toFIG. 7, below.

Techniques discussed herein improve the operation of a CNN in labelingand locating objects, including but not limited to motorcycles, bysimulating and synthesizing additional images of an object inconfigurations and noise factors underrepresented in a training dataset.Underrepresentation is defined by classifying images according to objectconfigurations and noise factor and then comparing the number of imagesthat include a particular configuration and/or noise factor to theaverage number of images for each configuration and/or noise factor. Forexample, if, in a training dataset, the average number of images inwhich any vehicle type is partially obscured is n, and the number ofimages that include partially obscured motorcycles in less than n, thenthe configuration that includes “partially obscured motorcycles” isunderrepresented.

Disclosed herein is a method, including determining a secondconvolutional neural network (CNN) training dataset by determining anunderrepresented object configuration and an underrepresented noisefactor corresponding to an object in a first CNN training dataset,generating one or more simulated images including the objectcorresponding to the underrepresented object configuration in the firstCNN training dataset by inputting ground truth data corresponding to theobject into a photorealistic rendering engine generating one or moresynthetic images including the object corresponding to theunderrepresented noise factor in the first CNN training dataset byprocessing the simulated images with a generative adversarial network(GAN) to determine a second CNN training dataset and training the CNNusing the first and the second CNN training datasets. An image acquiredby a sensor can be input to the trained CNN and output an object labeland an object location corresponding to the object corresponding to theunderrepresented object configuration and underrepresented object noisefactor. A second underrepresented noise factor can be determined thatcorresponds to the CNN not outputting the object label and the objectlocation corresponding to the object by testing the CNN with a testdataset based on real world images, one or more synthetic imagesincluding the object corresponding to the second underrepresented noisefactor can be generated to determine a third CNN training dataset byinputting images from the first training dataset and the second trainingdataset into the GAN and the CNN can be retrained using the thirdtraining dataset to output a second object label and a second objectlocation corresponding to the second underrepresented noise factor usingthe first, second and third CNN training datasets. The first objectconfiguration can be underrepresented when a number of images in thefirst CNN training dataset that include the first object configurationis less than an average number of images that include each other objectconfiguration. The first noise factor can be underrepresented when anumber of images in the first CNN training dataset that include thefirst noise factor is less than an average number of images that includeeach other noise factor.

The first and second CNN training datasets can include images thatinclude the object and corresponding ground truth that includes theobject label and the object location for the object included in theimages. Training a second CNN with the second CNN training dataset canreduce false positives output by a first CNN trained with the first CNNtraining dataset when processing real-world data, wherein a falsepositive is an object label incorrectly applied to an object occurringin an image. The object label can be a text string that identifies anobject included in an input image and the object location is a boundingbox corresponding to the object included in the input image. Objectconfigurations can include values and an arrangement of pixelscorresponding to the object based on an object location, an objectorientation, and partial obscuring of the object by another objectand/or another object's shadow. Noise factors can include values and anarrangement of pixels corresponding to the object based on environmentalconditions including partial or full sunlight, precipitation includingrain or snow, fog, and dust. The photorealistic rendering engine can bea software program that inputs ground truth data regarding objectconfiguration and outputs an image of a traffic scene that appears as ifit were acquired by a real-world camera. The GAN can be trained togenerate simulated images by training the GAN with real-world imagesthat include noise factors. Training the GAN with real-world images caminclude comparing an image output by the GAN with input real-worldimages to determine similarity between the image output by the GAN withinput real-world images by correlating the output images with the inputreal-world images. The CNN can include convolutional layers that outputhidden variables to fully-connected layers that output states thatinclude the object label and object location. The CNN can be trained tooutput states corresponding to the object label and object location byprocessing an input image a plurality of times and comparing the outputstates to ground truth data corresponding to the input image.

Further disclosed is a computer readable medium, storing programinstructions for executing some or all of the above method steps.Further disclosed is a computer programmed for executing some or all ofthe above method steps, including a computer apparatus, programmed todetermine a second convolutional neural network (CNN) training datasetby determining an underrepresented object configuration and anunderrepresented noise factor corresponding to an object in a first CNNtraining dataset, generate one or more simulated images including theobject corresponding to the underrepresented object configuration in thefirst CNN training dataset by inputting ground truth data correspondingto the object into a photorealistic rendering engine generating one ormore synthetic images including the object corresponding to theunderrepresented noise factor in the first CNN training dataset byprocessing the simulated images with a generative adversarial network(GAN) to determine a second CNN training dataset and train the CNN usingthe first and the second CNN training datasets. An image acquired by asensor can be input to the trained CNN and output an object label and anobject location corresponding to the object corresponding to theunderrepresented object configuration and underrepresented object noisefactor. A second underrepresented noise factor can be determined thatcorresponds to the CNN not outputting the object label and the objectlocation corresponding to the object by testing the CNN with a testdataset based on real world images, one or more synthetic imagesincluding the object corresponding to the second underrepresented noisefactor can be generated to determine a third CNN training dataset byinputting images from the first training dataset and the second trainingdataset into the GAN and the CNN can be retrained using the thirdtraining dataset to output a second object label and a second objectlocation corresponding to the second underrepresented noise factor usingthe first, second and third CNN training datasets. The first objectconfiguration can be underrepresented when a number of images in thefirst CNN training dataset that include the first object configurationis less than an average number of images that include each other objectconfiguration. The first noise factor can be underrepresented when anumber of images in the first CNN training dataset that include thefirst noise factor is less than an average number of images that includeeach other noise factor.

The computer can be further programmed to include the first and secondCNN training datasets images that include the object and correspondingground truth that includes the object label and the object location forthe object included in the images. Training a second CNN with the secondCNN training dataset can reduce false positives output by a first CNNtrained with the first CNN training dataset when processing real-worlddata, wherein a false positive is an object label incorrectly applied toan object occurring in an image. The object label can be a text stringthat identifies an object included in an input image and the objectlocation is a bounding box corresponding to the object included in theinput image. Object configurations can include values and an arrangementof pixels corresponding to the object based on an object location, anobject orientation, and partial obscuring of the object by anotherobject and/or another object's shadow. Noise factors can include valuesand an arrangement of pixels corresponding to the object based onenvironmental conditions including partial or full sunlight,precipitation including rain or snow, fog, and dust.

The photorealistic rendering engine can be a software program thatinputs ground truth data regarding object configuration and outputs animage of a traffic scene that appears as if it were acquired by areal-world camera. The GAN can be trained to generate simulated imagesby training the GAN with real-world images that include noise factors.Training the GAN with real-world images cam include comparing an imageoutput by the GAN with input real-world images to determine similaritybetween the image output by the GAN with input real-world images bycorrelating the output images with the input real-world images. The CNNcan include convolutional layers that output hidden variables tofully-connected layers that output states that include the object labeland object location. The CNN can be trained to output statescorresponding to the object label and object location by processing aninput image a plurality of times and comparing the output states toground truth data corresponding to the input image.

FIG. 1 is a diagram of a traffic infrastructure system 100 that includesa vehicle 110 operable in autonomous (“autonomous” by itself in thisdisclosure means “fully autonomous”), semi-autonomous, and occupantpiloted (also referred to as non-autonomous) mode. One or more vehicle110 computing devices 115 can receive data regarding the operation ofthe vehicle 110 from sensors 116. The computing device 115 may operatethe vehicle 110 in an autonomous mode, a semi-autonomous mode, or anon-autonomous mode.

The computing device 115 includes a processor and a memory such as areknown. Further, the memory includes one or more forms ofcomputer-readable media, and stores instructions executable by theprocessor for performing various operations, including as disclosedherein. For example, the computing device 115 may include programming tooperate one or more of vehicle brakes, propulsion (e.g., control ofacceleration in the vehicle 110 by controlling one or more of aninternal combustion engine, electric motor, hybrid engine, etc.),steering, climate control, interior and/or exterior lights, etc., aswell as to determine whether and when the computing device 115, asopposed to a human operator, is to control such operations.

The computing device 115 may include or be communicatively coupled to,e.g., via a vehicle communications bus as described further below, morethan one computing devices, e.g., controllers or the like included inthe vehicle 110 for monitoring and/or controlling various vehiclecomponents, e.g., a powertrain controller 112, a brake controller 113, asteering controller 114, etc. The computing device 115 is generallyarranged for communications on a vehicle communication network, e.g.,including a bus in the vehicle 110 such as a controller area network(CAN) or the like; the vehicle 110 network can additionally oralternatively include wired or wireless communication mechanisms such asare known, e.g., Ethernet or other communication protocols.

Via the vehicle network, the computing device 115 may transmit messagesto various devices in the vehicle and/or receive messages from thevarious devices, e.g., controllers, actuators, sensors, etc., includingsensors 116. Alternatively, or additionally, in cases where thecomputing device 115 actually comprises multiple devices, the vehiclecommunication network may be used for communications between devicesrepresented as the computing device 115 in this disclosure. Further, asmentioned below, various controllers or sensing elements such as sensors116 may provide data to the computing device 115 via the vehiclecommunication network.

In addition, the computing device 115 may be configured forcommunicating through a vehicle-to-infrastructure (V-to-I) interface 111with a remote server computer 120, e.g., a cloud server, via a network130, which, as described below, includes hardware, firmware, andsoftware that permits computing device 115 to communicate with a remoteserver computer 120 via a network 130 such as wireless Internet (WI-FI®)or cellular networks. V-to-I interface 111 may accordingly includeprocessors, memory, transceivers, etc., configured to utilize variouswired and/or wireless networking technologies, e.g., cellular,BLUETOOTH® and wired and/or wireless packet networks. Computing device115 may be configured for communicating with other vehicles 110 throughV-to-I interface 111 using vehicle-to-vehicle (V-to-V) networks, e.g.,according to Dedicated Short Range Communications (DSRC) and/or thelike, e.g., formed on an ad hoc basis among nearby vehicles 110 orformed through infrastructure-based networks. The computing device 115also includes nonvolatile memory such as is known. Computing device 115can log data by storing the data in nonvolatile memory for laterretrieval and transmittal via the vehicle communication network and avehicle to infrastructure (V-to-I) interface 111 to a server computer120 or user mobile device 160.

As already mentioned, generally included in instructions stored in thememory and executable by the processor of the computing device 115 isprogramming for operating one or more vehicle 110 components, e.g.,braking, steering, propulsion, etc., without intervention of a humanoperator. Using data received in the computing device 115, e.g., thesensor data from the sensors 116, the server computer 120, etc., thecomputing device 115 may make various determinations and/or controlvarious vehicle 110 components and/or operations without a driver tooperate the vehicle 110. For example, the computing device 115 mayinclude programming to regulate vehicle 110 operational behaviors (i.e.,physical manifestations of vehicle 110 operation) such as speed,acceleration, deceleration, steering, etc., as well as tacticalbehaviors (i.e., control of operational behaviors typically in a mannerintended to achieve efficient traversal of a route) such as a distancebetween vehicles and/or amount of time between vehicles, lane-change,minimum gap between vehicles, left-turn-across-path minimum,time-to-arrival at a particular location and intersection (withoutsignal) minimum time-to-arrival to cross the intersection.

Controllers, as that term is used herein, include computing devices thattypically are programmed to monitor and/or control a specific vehiclesubsystem. Examples include a powertrain controller 112, a brakecontroller 113, and a steering controller 114. A controller may be anelectronic control unit (ECU) such as is known, possibly includingadditional programming as described herein. The controllers maycommunicatively be connected to and receive instructions from thecomputing device 115 to actuate the subsystem according to theinstructions. For example, the brake controller 113 may receiveinstructions from the computing device 115 to operate the brakes of thevehicle 110.

The one or more controllers 112, 113, 114 for the vehicle 110 mayinclude known electronic control units (ECUs) or the like including, asnon-limiting examples, one or more powertrain controllers 112, one ormore brake controllers 113, and one or more steering controllers 114.Each of the controllers 112, 113, 114 may include respective processorsand memories and one or more actuators. The controllers 112, 113, 114may be programmed and connected to a vehicle 110 communications bus,such as a controller area network (CAN) bus or local interconnectnetwork (LIN) bus, to receive instructions from the computing device 115and control actuators based on the instructions.

Sensors 116 may include a variety of devices known to provide data viathe vehicle communications bus. For example, a radar fixed to a frontbumper (not shown) of the vehicle 110 may provide a distance from thevehicle 110 to a next vehicle in front of the vehicle 110, or a globalpositioning system (GPS) sensor disposed in the vehicle 110 may providegeographical coordinates of the vehicle 110. The distance(s) provided bythe radar and/or other sensors 116 and/or the geographical coordinatesprovided by the GPS sensor may be used by the computing device 115 tooperate the vehicle 110 autonomously or semi-autonomously, for example.

The vehicle 110 is generally a land based vehicle 110 capable ofautonomous and/or- semi-autonomous operation and having three or morewheels, e.g., a passenger car, light truck, etc. The vehicle 110includes one or more sensors 116, the V-to-I interface 111, thecomputing device 115 and one or more controllers 112, 113, 114. Thesensors 116 may collect data related to the vehicle 110 and theenvironment in which the vehicle 110 is operating. By way of example,and not limitation, sensors 116 may include, e.g., altimeters, cameras,LIDAR, radar, ultrasonic sensors, infrared sensors, pressure sensors,accelerometers, gyroscopes, temperature sensors, pressure sensors, hallsensors, optical sensors, voltage sensors, current sensors, mechanicalsensors such as switches, etc. The sensors 116 may be used to sense theenvironment in which the vehicle 110 is operating, e.g., sensors 116 candetect phenomena such as weather conditions (precipitation, externalambient temperature, etc.), the grade of a road, the location of a road(e.g., using road edges, lane markings, etc.), or locations of targetobjects such as neighboring vehicles 110. The sensors 116 may further beused to collect data including dynamic vehicle 110 data related tooperations of the vehicle 110 such as velocity, yaw rate, steeringangle, engine speed, brake pressure, oil pressure, the power levelapplied to controllers 112, 113, 114 in the vehicle 110, connectivitybetween components, and accurate and timely performance of components ofthe vehicle 110.

Traffic infrastructure system 100 can include one or more edge computingnodes 170. Edge computing nodes 170 are computing devices as describedabove that are located near roadways, and can be in communication withstationary or moveable sensors 180. For example, a sensor 180 can be astationary video camera attached to a pole 190, building, or otherstationary structure to give the sensor 180 a view of traffic. Mobilesensors 180 can be mounted on drones or other mobile platforms toprovide views of traffic from positions not available to stationarysensors. Edge computing nodes 170 further can be in communication withcomputing devices 115 in vehicle 110, server computers 120, and usermobile devices 160 such as smart phones. Server computers 120 can becloud-based computer resources that can be called upon by edge computingnodes 170 to provide additional computing resources when needed.

Vehicles can be equipped to operate in both autonomous and occupantpiloted mode. By a semi- or fully-autonomous mode, we mean a mode ofoperation wherein a vehicle can be piloted partly or entirely by acomputing device as part of a system having sensors and controllers. Thevehicle can be occupied or unoccupied, but in either case the vehiclecan be partly or completely piloted without assistance of an occupant.For purposes of this disclosure, an autonomous mode is defined as one inwhich each of vehicle propulsion (e.g., via a powertrain including aninternal combustion engine and/or electric motor), braking, and steeringare controlled by one or more vehicle computers; in a semi-autonomousmode the vehicle computer(s) control(s) one or more of vehiclepropulsion, braking, and steering. In a non-autonomous mode, none ofthese are controlled by a computer.

FIG. 2 is a diagram of a CNN 200. A CNN 200 is a software program thatcan execute on a computing device 115 included in a vehicle 110. A CNN200 inputs and processes an image 202. An image 202 can be acquired by avideo camera included as a sensor 116 in a vehicle 110, for example. Animage 202 can also be acquired by a sensor 180, which can be a videocamera, included in a traffic infrastructure system 100 and communicatedto the CNN 200 in a computing device 115 by an edge computing node 170via a network 130. CNN 200 processes an input image 202 and producesoutput states (STATES) 210 that include an object label and an objectlocation for the object label.

A CNN includes convolutional layers (CONV) 204. Convolutional layers 204use a plurality of convolutional kernels to reduce an input image 202 tohidden variables (HV) 206. The hidden variables 206 are an encodedrepresentation of the input image 202 that includes data correspondingto objects in the input image 202. The hidden variables 206 are input tofully connected layers (FULL) 208 that process the hidden variables 206to produce output states 210 corresponding to an object label and anobject location for the object label. A CNN 200 can be trained to outputan object label and an object location for the object label byprocessing hidden variables 206 with fully connected layers 208. A CNN200 can output more than one object label each with corresponding objectlocation for each object occurring in an input image 202.

A CNN 200 can be trained to input an image 202 and output states 210including object labels and object locations using a training datasetthat includes training images 202 that include objects and ground truthcorresponding to the objects included in the images. As defined above,ground truth includes labels for the objects obtained independently fromthe CNN 200. For example, a human can view the training images 202 inthe training dataset and determine object labels. As defined above, anobject label is a text string corresponding to a bounding box thatlocates an object in an image. The location of the bounding box in theimage can be processed to yield a location for the object in real worldcoordinates. An object location corresponding to the object label can bedetermined by measuring the location of the object in real worldcoordinates in the real-world traffic scene corresponding to thetraining image 202. Real world coordinates specify a location in thereal, i.e., physical, world, and typically are three dimensionalcoordinates measured with respect to a global coordinate system such aslatitude, longitude, and altitude. The object location can also beestimated using photogrammetry techniques on the training image.Photogrammetry is a technique that uses measurements in pixels oflocations of objects in an image, and data regarding real-world ofmeasurements of objects such as height and width of a make and model ofvehicle, to determine real-world locations of objects in an image.

A training image 202 can be input to a CNN 200 and a plurality of testscan be run while varying the parameters used to program theconvolutional layers 204 and fully connected layers 208. After each runof a plurality of tests the output states 210 are back-propagated tocompare with the ground truth. Back-propagation is a technique thatreturns output states 210 from a CNN 200 to the input to be compared toground truth corresponding to the input image 202. In this example,during training a label and a location can be back-propagated to becompared to the label and location included in the ground truth todetermine a loss function. The loss function determines how accuratelythe CNN has processed the input image 202. A CNN 200 can be executed aplurality of times on a single input image 202 while varying parametersthat control the processing of the CNN 200. Parameters that correspondto correct answers as confirmed by a loss function that compares theoutput states 210 to the ground truth are saved as candidate parameters.Following the test runs, the candidate parameters that produce the mostcorrect results are saved as the parameters that will be used to programthe CNN 200 during operation.

FIG. 3 is a diagram of a generative adversarial network (GAN) 300. A GANis a neural network that can be trained to input data, for example animage (IMAGE) 302 and output 312 an output image (OUT) 306 that has beenprocessed by the generative layers (GEN) 304 of GAN 300 to alter theappearance of the input image 302. For example, a GAN 300 can be trainedto input images 302 that include objects in full sunlight. A GAN 300 canbe trained to alter the input images 302 to appear as if they wereacquired in environmental conditions including shadows, dim lighting,nighttime, rain, snow, fog, etc. A GAN 300 can be used to generateoutput images 306 that correspond to underrepresented noise factors in atraining dataset for a CNN 200. For example, if images including anobject traveling on a roadway in the rain are underrepresented in atraining dataset, images of the object traveling on a roadway in clearconditions can be input into an appropriately trained GAN 300 to produceimages of the object traveling on a roadway while it is raining. In thisfashion, images corresponding to underrepresented noise factors can beadded to a training dataset to train a CNN.

A GAN 300 can be trained to produce output images 306 that include aselected noise factor by first training a discriminator (DISC) 308 todiscriminate between images that include the selected noise factor andimages that do not include the selected noise factor. A discriminator308 is trained by inputting training images (TI) 314, a portion of whichinclude the selected noise factor, along with ground truth (“yes” or“no”) that indicates whether each training image 314 includes theselected noise factor. In this fashion discriminator 308 is trained todetermine an output value (OUT) 310 that indicates whether an outputimage 306 input to the discriminator includes selected noise factor ornot.

In examples where a discriminator 308 is trained with real world imagesas training images 314, the output value 310 can be regarded as adetermination as to whether the output image 306 input to thediscriminator 308 is “real” or “fake” by training the discriminator 308to emit an output value 310 of “real” for input training images 314 thatinclude the selected noise factor and “fake” for input training images314 that do not include the noise factor. “Real” and “fake” are binaryvalues corresponding to logical “true” and “false” or 1 and 0. In thiscontext, “real” refers to images that appear similar to the real worldimages used to train the discriminator 308 and “fake” refers to anoutput image that does not appear similar to a real world image. Fortraining purposes, “real” and “fake” images are determined by humanobservers.

The generator 304 can be trained to generate output images 306 thatappear “real” and the discriminator 308 is trained to distinguishbetween “real” and “fake” training images 314 in a stepwise simultaneousfashion. In stepwise simultaneous training, the generator 304 can betrained to generate output images 306 that are determined by thediscriminator 308 to be “real” in response to input images 302 that donot include the selected noise factor by dividing the training datasetin to a plurality of epochs or portions. The generator 304 is trained onone or more epochs of the training dataset while the discriminator 308is held constant. The discriminator 308 is then trained on one or moreepochs of the training dataset while the generator 304 is held constant.This process is repeated until the GAN 300 converges, i.e. until thegenerator 304 produces images that are correctly determined to be “real”by the discriminator 308 based on ground truth at a rate that isdetermined to be acceptable by a user, for example 90%. For example, aGAN 300 can be trained to input an input image 302 that includes adaylight, summer scene and output 312 an output image 306 that includesthe same scene altered to look as if the image were acquired at night,in the winter, in a snowstorm.

FIG. 4 is a diagram of an image 400 of traffic scene 402. Image 400 canbe acquired by a camera included in a vehicle 110, for example arear-facing camera. Traffic scene 402 includes a roadway 404, and on theroadway 404 a motorcycle 406. The motorcycle is lane splitting between avehicle 408 and a large truck 410. In traffic scenes like traffic scene402, a motorcycle 406 can be maneuvering at higher speeds than thesurrounding traffic, including vehicle 408 and large truck 410. Becauseof the higher speeds and sometimes unpredictable motion of vehicles likemotorcycle 406, a CNN 200 executing on a computing device 115 in avehicle 110 should label and locate the motorcycle 406 as soon aspossible to permit computing device 115 to determine if a response isnecessary and execute the response. A response to the presence of amotorcycle 406 might be to move over within a lane to permit themotorcycle to lane split next to the vehicle, for example.

Motorcycle labeling and location using a CNN 200 requires large amountsof training data, typically including >1000 images, with diversity inmotorcycle shapes and sizes, occlusion types (neighboring cars, buses,trucks), shadows (from neighboring trees, vehicles etc.), ground types(cement, gravel, asphalt), weather conditions (dry, wet, snowy grounds),etc. As discussed above, collecting such diverse real-world datasets iscostly and resource-intensive. This is why open-source CNN 200 trainingdatasets typically have a small percentage of motorcycles in them. Forexample, in the widely popular KITTI dataset, available from theKarlsruhe Institute of Technology, 76131 Karlsruhe, Germany, only 0.7%of the objects are motorcycles. Similarly, the more recent BDD100Kdataset, available from Berkeley Artificial Intelligence Research,University of California, Berkeley, Berkeley, Calif., is much bigger andclaims to be more diverse but has only 0.04% of its objects asmotorcycles. Neither of these datasets include enough images to be ableto train CNN 200 to be a robust motorcycle detector. Acquiring areal-world dataset of motorcycle images and corresponding ground truthdata sufficient to train a CNN 200 could require thousands of man hours.

As discussed above, real-world training dataset distributions are oftenskewed towards vehicle configurations and noise factors that are easy tocollect. This can lead to a CNN 200 performing poorly in scenarios thatare not well-represented in the training dataset, which for the task ofmotorcycle detection also happen to be the ones that are most pertinent.Furthermore, datasets are typically divided into training data andtesting data portions. A CNN 200 is trained using the training datasetand then tested using the testing dataset based on real world images andground truth to determine accuracy metrics such as the percentages oftrue and false positive labeling. A true positive is defined as acorrect location or object label corresponding to the ground truth and afalse positive is defined as an incorrect location or object labelcorresponding to the ground truth. The CNN 200 not outputting the objectlabel and the object location corresponding to the object when testingthe CNN 200 is defined as a true positive rate of less than 90% or afalse positive rate of greater than 10%. Additionally, accurate groundtruth is essential for reliable performance of any supervised learningalgorithm. Ground truth annotation of real-world data is currently doneby humans using dedicated software tools. This process is botherror-prone and time-consuming.

Techniques described herein improve the training of a CNN 200 to labeland locate objects that are underrepresented in training datasets bygenerating simulated images with photorealistic rendering software thatinclude objects in underrepresented configurations and then processingthe simulated images with a GAN 300 to produce synthetic images thatinclude a plurality of noise factors. Simulated images can be generatedusing photorealistic rendering software. Photorealistic renderingsoftware is also referred to as game engine software because it is alsoused to generate images for video game software. A photorealistic imageis defined as an image that appears almost as if it were acquired by areal world camera viewing a real world scene. Example photorealisticrendering software is aDRIVE, a software program developed at FordGreenfield Labs, Palo Alto, Calif. 94304. The aDRIVE software can beused to generate realistic views of traffic scenes including simulatedvideo sequences to aid in development of automated vehicle software, forexample. For purposes of training a CNN 200 as discussed herein,simulated images produced by photorealistic rendering software are notsufficiently realistic to produce training images that include complexconfigurations of objects such as motorcycles with a plurality ofdifferent noise factors. Techniques discussed herein improve CNN 200training datasets by processing simulated images with a GAN 300 toproduce synthetic images that include realistic object configurationswith a plurality of noise factors.

Photorealistic rendering software can input a scene description thatincludes a location and size of a roadway and the location and types offoliage and buildings around the roadway. The scene description alsoincludes the location, size and types of vehicles on or around theroadway. The scene description also includes the intensity, location andtype of lighting for the scene. For example, a scene description caninclude a three-lane highway with grass in the median and low bushesnext to the highway, a large truck in the left-most lane, a van in thecenter lane and a motorcycle between the large truck and the van. Thescene can be an evening daylight scene with a shadow from the largetruck partially obscuring the motorcycle. This scene description can beinput to photorealistic rendering software produce an image similar toimage 400. Because input to the photorealistic rendering softwareincludes the location and size of vehicles and other objects such asbarriers and lane markers, ground truth corresponding to the outputimage is available for training a CNN.

Images simulated by the photorealistic rendering software, withoutsignificant investment in time, money and computing resources ongraphics expertise, typically lack the realism that can be used tosuccessfully train a CNN 200. Thus, a GAN 300 can be used toautomatically and efficiently translate simulated images into morephotorealistic images. Additionally, a GAN 300 can be used to diversifyreal images by adding a plurality of noise factors. For example, thephotorealistic rendering software can produce images that appear as ifthey were acquired in clear weather at various times of the day ornight. A GAN can be trained to input the photorealistic images andoutput images that appear as if they were acquired in rain that can varyfrom light to heavy, in snow that can vary from light to heavy atvarious seasons and with various levels of lighting including artificiallights. i.e. streetlights. Synthetic images produced in this fashion canbe used to augment and improve the diversity of real-world datasetsspecifically for training, validation and testing of CNN-basedmotorcycle detection.

FIG. 5 is a diagram of an image 500 of a traffic scene 502 that includesa roadway 504 and a motorcycle 506. Image 500 can be acquired by acamera included in a vehicle 110. As discussed above, acquiring a realworld image that includes a motorcycle 506 in a given configuration isuncommon in existing training datasets. Acquiring a real world imagelike image 500 can require renting a motorcycle 506, hiring a rider andthen arranging the motorcycle 506 and one or more additional vehicles110 in various configurations. Adding in acquiring real world images andacquiring ground truth with a variety of noise factors can increase thetime, expense and resources involved in acquiring the image data andcorresponding ground truth greatly.

FIG. 6 is a diagram of a simulated image 600 of a traffic scene 602 thatincludes a roadway 604 and a motorcycle 606. Simulated image 600 can begenerated by photorealistic rendering software such as aDRIVE, discussedabove in relation to FIG. 4. The photorealistic rendering software cangenerate a simulated image 600 of a traffic scene 602 based on scenedescriptions that include the location and size of a roadway 604 and thelocation and size of a motorcycle 606 described in relation to theroadway 604. Because the simulated images 600 are generated from scenedescriptions that include the location and size of the objects in thesimulated image 600, ground truth corresponding to the image isincluded. A plurality of simulated images 600 including a motorcycle 606in a variety of different configurations and including a variety ofother objects including vehicles can be generated more quickly and lessexpensively than acquiring real world images of a motorcycle 606.

FIG. 7 is a diagram of an image 700 of a traffic scene 702 that includesa roadway 704 and a motorcycle 706. Image 700 is a synthetic image 700generated by inputting a simulated image 600 into a GAN 300 that hasbeen trained to add realistic noise factors to simulated images 600.Adding noise factors to a simulated image 600 to form a synthetic image700 can make the simulated image 600 appear more realistic and addenvironmental conditions to the simulated image 600. For example,synthetic image 700 includes motion blur (dotted lines) 708 and rain(dashed lines) 710. Motion blur is an optical phenomenon present inimage data caused by motion of the camera or object being captured. AGAN 300 can also be used to add realistic noise factors to real worldtraining images. For example, real world training images acquired indaytime can be processed by a GAN 300 to appear as if they were acquiredat nighttime. If a particular point is moving with respect to the imageplane of the sensor while a camera shutter is open, or while a videoframe is being acquired, the point will create a streak or blur in theimage. A plurality of simulated images 600 can be generatedcorresponding to underrepresented vehicle configurations and thenprocessed to form synthetic images 700 that include a plurality of noisefactors corresponding to underrepresented noise factors. Because thesynthetic images 700 include corresponding ground truth, photorealisticsynthetic images 700 including noise factors can be generated at muchless expense and in much less time than acquiring real world images totrain a CNN 200.

Synthetic images 700 can be used to generate a second training datasetfor retraining a CNN 200 to correctly output object labels and objectlocations for object configurations and noise factors that areunderrepresented in training datasets based solely on real world images.As discussed above object configurations corresponding to motorcyclescan be underrepresented in training datasets for a CNN 200.Underrepresented object configurations with underrepresented noisefactors can be remedied by generating synthetic images that include theunderrepresented object configurations and processing the syntheticimages with a GAN 300 to produce simulated images that includeunderrepresented object configurations with underrepresented noisefactors. For example, images that include motorcycles lane splitting inthe rain can be underrepresented in a training dataset based on realworld images. Techniques described herein remedy the underrepresentedobject configurations and noise factors more efficiently, thereby savingtime and expense by using photorealistic rendering software and GANs 300to produce simulated images to be included in a second training dataset.A CNN 200 retrained using the second training dataset in addition to thefirst training dataset will be more likely to produce correct objectlabels and correct object locations in response to real world inputimages that include motorcycles in object configurations with noisefactors underrepresented in the first training dataset that onlyincludes real world images.

The process of generating simulated images for training a CNN 200 can beiterative. For example, the CNN 200 can be tested to determine whetherthe CNN 200 can correctly label input images with >90% accuracy, i.e.does the label output by the CNN 200 match the label included in theground truth for greater than 90% of the input images. Locating alabeled object with >90% accuracy can include determining a location forthe object with less than 10% error with respect to the distance fromthe object to the sensor as determined by the ground truth data. Forexample, a CNN 200 can be expected to label and locate objects such asmotorcycles in all object configurations and noise factors available ina test dataset with 90% accuracy. If 90% accuracy is not achieved by theCNN 200 following retraining with the second training dataset, a thirdtraining dataset can be generated corresponding to the objectconfigurations and noise factors that were unsuccessfully labeled andlocated by the CNN 200 after retraining. The CNN 200 can be againretrained using the first, second, and third datasets and the CNN 200the re-tested using a test dataset that includes real world image dataand real world ground truth corresponding to the real world image data.This process can repeat until 90% accuracy in achieved by the CNN 200.

FIG. 8 is a diagram of a flowchart, described in relation to FIGS. 1-7,of a process for labeling and locating objects in images. Process 800can be implemented by a processor of computing device, taking as inputinformation from sensors, and executing commands, and outputting objectinformation, for example. Process 800 includes multiple blocks that canbe executed in the illustrated order. Process 800 could alternatively oradditionally include fewer blocks or can include the blocks executed indifferent orders.

Process 800 begins at block 802, where a training dataset is analyzed todetermine underrepresented object configurations and noise factors. Forexample, motorcycles 406 have been determined to be underrepresented intraining datasets used to train CNNs 200 used to operate vehicles 110.Examples of underrepresented object configurations that can apply tomotorcycles 406 include motorcycles 406 that are partially obscured byother vehicles or vehicle shadows. Other object configurations that canapply to motorcycles 406 include small images, where a small number ofpixels correspond to the image of the motorcycle 406 due to distance andimage where the motorcycle 406 appears in an unexpected location.Underrepresented noise factors include images where the motorcycle 406or the background is subject to motion blur due to motion of themotorcycle 406 or the camera acquiring the image.

At block 804 the computing device 115 a computing device 115 generatessynthetic images 600 based on scene descriptions corresponding tounderrepresented object configurations using game engine software. Thescene descriptions include ground truth corresponding to the generatedsynthetic images 600.

At block 806 the computing device 115 generates synthetic images 700corresponding to underrepresented noise factors based on the syntheticimages 600. The underrepresented noise factors include motion blur,lighting, and environmental conditions such as rain, snow, fog, dust,etc. The simulated images 700 including corresponding ground truth areadded to a training dataset for a CNN 200.

At block 808 computing device 115 trains a CNN 200 using the trainingdataset including the synthetic images 700 and corresponding groundtruth. Because the training dataset includes images corresponding tounderrepresented object configuration and underrepresented noisefactors, the probability that the CNN 200 will correctly identify amotorcycle 406, for example, in data acquired by a camera in a vehicle110 while the vehicle 110 is operating on a roadway. The trained CNN 200can be downloaded to a vehicle 110 and used to process acquired imagedata from a vehicle sensor 116. The CNN 200 can also be included in anedge computing node 170 and process data acquired by a camera 180included in a traffic infrastructure system 100. Object label andlocation data output from the CNN 200 can be used by a vehicle 110 or atraffic infrastructure system 100 to determine a vehicle path upon whichto operate a vehicle 110. For example, a computing device 115 in avehicle or an edge computing node 170 can use the object label andobject location data to determine a polynomial path that avoids contactwith the object. Following block 808 process 800 ends.

Computing devices such as those discussed herein generally each includescommands executable by one or more computing devices such as thoseidentified above, and for carrying out blocks or steps of processesdescribed above. For example, process blocks discussed above may beembodied as computer-executable commands.

Computer-executable commands may be compiled or interpreted fromcomputer programs created using a variety of programming languagesand/or technologies, including, without limitation, and either alone orin combination, Java™, C, C++, Python, Julia, SCALA, Visual Basic, JavaScript, Perl, HTML, etc. In general, a processor (e.g., amicroprocessor) receives commands, e.g., from a memory, acomputer-readable medium, etc., and executes these commands, therebyperforming one or more processes, including one or more of the processesdescribed herein. Such commands and other data may be stored in filesand transmitted using a variety of computer-readable media. A file in acomputing device is generally a collection of data stored on a computerreadable medium, such as a storage medium, a random access memory, etc.

A computer-readable medium includes any medium that participates inproviding data (e.g., commands), which may be read by a computer. Such amedium may take many forms, including, but not limited to, non-volatilemedia, volatile media, etc. Non-volatile media include, for example,optical or magnetic disks and other persistent memory. Volatile mediainclude dynamic random access memory (DRAM), which typically constitutesa main memory. Common forms of computer-readable media include, forexample, a floppy disk, a flexible disk, hard disk, magnetic tape, anyother magnetic medium, a CD-ROM, DVD, any other optical medium, punchcards, paper tape, any other physical medium with patterns of holes, aRAM, a PROM, an EPROM, a FLASH-EEPROM, any other memory chip orcartridge, or any other medium from which a computer can read.

All terms used in the claims are intended to be given their plain andordinary meanings as understood by those skilled in the art unless anexplicit indication to the contrary in made herein. In particular, useof the singular articles such as “a,” “the,” “said,” etc. should be readto recite one or more of the indicated elements unless a claim recitesan explicit limitation to the contrary.

The term “exemplary” is used herein in the sense of signifying anexample, e.g., a reference to an “exemplary widget” should be read assimply referring to an example of a widget.

The adverb “approximately” modifying a value or result means that ashape, structure, measurement, value, determination, calculation, etc.may deviate from an exactly described geometry, distance, measurement,value, determination, calculation, etc., because of imperfections inmaterials, machining, manufacturing, sensor measurements, computations,processing time, communications time, etc.

In the drawings, the same reference numbers indicate the same elements.Further, some or all of these elements could be changed. With regard tothe media, processes, systems, methods, etc. described herein, it shouldbe understood that, although the steps or blocks of such processes, etc.have been described as occurring according to a certain orderedsequence, such processes could be practiced with the described stepsperformed in an order other than the order described herein. It furthershould be understood that certain steps could be performedsimultaneously, that other steps could be added, or that certain stepsdescribed herein could be omitted. In other words, the descriptions ofprocesses herein are provided for the purpose of illustrating certainembodiments, and should in no way be construed so as to limit theclaimed invention.

1. A computer, comprising: a processor; and a memory, the memoryincluding instructions executable by the processor to: determine asecond convolutional neural network (CNN) training dataset bydetermining an underrepresented object configuration and anunderrepresented noise factor corresponding to an object in a first CNNtraining dataset; generate one or more simulated images including theobject corresponding to the underrepresented object configuration in thefirst CNN training dataset by inputting ground truth data correspondingto the object into a photorealistic rendering engine; generate one ormore synthetic images including the object corresponding to theunderrepresented noise factor in the first CNN training dataset byprocessing the simulated images with a generative adversarial network(GAN) to determine a second CNN training dataset; train the CNN usingthe first and the second CNN training datasets; and input an imageacquired by a sensor to the trained CNN and output an object label andan object location corresponding to the object corresponding to theunderrepresented object configuration and underrepresented object noisefactor.
 2. The computer of claim 1, the instructions including furtherinstructions to determine a second underrepresented noise factor thatcorresponds to the CNN not outputting the object label and the objectlocation corresponding to the object by testing the CNN with a testdataset based on real world images; generate one or more syntheticimages including the object corresponding to the second underrepresentednoise factor to determine a third CNN training dataset by inputtingimages from the first training dataset and the second training datasetinto the GAN; and retrain the CNN using the third training dataset tooutput a second object label and a second object location correspondingto the second underrepresented noise factor using the first, second andthird CNN training datasets.
 3. The computer of claim 1, wherein thefirst object configuration is underrepresented when a number of imagesin the first CNN training dataset that include the first objectconfiguration is less than an average number of images that include eachother object configuration.
 4. The computer of claim 1, wherein thefirst noise factor is underrepresented when a number of images in thefirst CNN training dataset that include the first noise factor is lessthan an average number of images that include each other noise factor.5. The computer of claim 1, wherein the first and second CNN trainingdatasets include images that include the object and corresponding groundtruth that includes the object label and the object location for theobject included in the images.
 6. The computer of claim 1, whereintraining a second CNN with the second CNN training dataset reduces falsepositives output by a first CNN trained with the first CNN trainingdataset when processing real-world data, wherein a false positive is anobject label incorrectly applied to an object occurring in an image. 7.The computer of claim 1, wherein the object label is a text string thatidentifies an object included in an input image and the object locationis a bounding box corresponding to the object included in the inputimage.
 8. The computer of claim 1, wherein object configurations includevalues and an arrangement of pixels corresponding to the object based onan object location, an object orientation, and partial obscuring of theobject by another object and/or another object's shadow.
 9. The computerof claim 1, wherein noise factors include values and an arrangement ofpixels corresponding to the object based on environmental conditionsincluding partial or full sunlight, precipitation including rain orsnow, fog, and dust.
 10. The computer of claim 9, wherein thephotorealistic rendering engine is a software program that inputs groundtruth data regarding object configuration and outputs an image of atraffic scene that appears as if it were acquired by a real-worldcamera.
 11. The computer of claim 1, wherein the GAN is trained togenerate simulated images by training the GAN with real-world imagesthat include noise factors.
 12. The computer of claim 11, whereintraining the GAN with real-world images includes comparing an imageoutput by the GAN with input real-world images to determine similaritybetween the image output by the GAN with input real-world images bycorrelating the output images with the input real-world images.
 13. Thecomputer of claim 1, wherein the CNN includes convolutional layers thatoutput hidden variables to fully-connected layers that output statesthat include the object label and object location.
 14. The computer ofclaim 1, wherein the CNN is trained to output states corresponding tothe object label and object location by processing an input image aplurality of times and comparing the output states to ground truth datacorresponding to the input image.
 15. The computer of claim 1, theinstructions including further instructions to operate a vehicle basedon the object label and object location output from the CNN.
 16. Thecomputer of claim 15, wherein operating the vehicle includes determininga vehicle path that avoids contact with the object.
 17. A method,comprising: determining a second convolutional neural network (CNN)training dataset by determining an underrepresented object configurationand an underrepresented noise factor corresponding to an object in afirst CNN training dataset; generating one or more simulated imagesincluding the object corresponding to the underrepresented objectconfiguration in the first CNN training dataset by inputting groundtruth data corresponding to the object into a photorealistic renderingengine; generating one or more synthetic images including the objectcorresponding to the underrepresented noise factor in the first CNNtraining dataset by processing the simulated images with a generativeadversarial network (GAN) to determine a second CNN training dataset;training the CNN using the first and the second CNN training datasets;and inputting an image acquired by a sensor to the trained CNN andoutputting an object label and an object location corresponding to theunderrepresented object configuration and underrepresented object noisefactor.
 18. The method of claim 17, further comprising determining asecond underrepresented noise factor that corresponds to the CNN notoutputting the object label and the object location corresponding to theobject by testing the CNN with a test dataset based on real worldimages; generate one or more synthetic images including the objectcorresponding to the second underrepresented noise factor to determine athird CNN training dataset by inputting images from the first CNNtraining dataset and the second CNN training dataset into the GAN; andretrain the CNN using the third CNN training dataset to output a secondobject label and a second object location corresponding to the secondunderrepresented object noise factor using the first, second, and thirdCNN training datasets.
 19. The method of claim 17, wherein a firstobject configuration is underrepresented when a number of images in thefirst CNN training dataset that include the first object configurationis less than an average number of images that include each other objectconfiguration.
 20. The method of claim 17, wherein a first noise factoris underrepresented when a number of images in the first CNN trainingdataset that include the first noise factor is less than an averagenumber of images that include each other noise factor.